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Abstract 

We survey incremental methods for minimizing a sum consisting of a large number of convex 

component functions fi. Our methods consist of iterations applied to single components, and have proved 
very effective in practice. We introduce a unified algorithmic framework for a variety of such methods, 
some involving gradient and subgradient iterations, which are known, and some involving combinations 
of subgradient and proximal methods, which are new and offer greater flexibility in exploiting the special 
structure of fi. We provide an analysis of the convergence and rate of convergence properties of these 
methods, including the advantages offered by randomization in the selection of components. We also survey 
applications in inference/machine learning, signal processing, and large-scale and distributed optimization. 


1. INTRODUCTION 


We consider optimization problems with a cost function consisting of a large number of component functions, 
such as 

m 


minimize E Mx) 
subject to X G X, 


( 1 . 1 ) 


where fi : 3?" i-)- 5R, i = 1,..., m, are real-valued functions, and A is a closed convex set.) We focus on 
the case where the number of components m is very large, and there is an incentive to use incremental 
methods that operate on a single component fi at each iteration, rather than on the entire cost function. If 
each incremental iteration tends to make reasonable progress in some “average” sense, then depending on the 
value of m, an incremental method may significantly outperform (by orders of magnitude) its nonincremental 
counterpart, as experience has shown. 


1 This is an extended version of similarly titled papers that will appear in Math. Programming Journal, and the 
edited volume Optimization for Machine Learning (S. Sra, S. Nowozin, and S. Wright, Eds.), MIT Press, 2012. 

2 The author is with the Dept, of Electr. Engineering and Comp. Science, M.I.T., Cambridge, Mass., 02139. His 
research was supported by the AFOSR under Grant FA9550-10-1-0412. Thanks are due to Huizhen (Janey) Yu for 
extensive helpful discussions and suggestions. Comments by Angelia Nedic and Ben Recht are also appreciated. 

f Throughout the paper, we will operate within the n-dimensional space IR” with the standard Euclidean norm, 
denoted || • ||. All vectors are considered column vectors and a prime denotes transposition, so x'x = ||a:|p. We will be 
using standard terminology of convex optimization, as given for example in textbooks such as Rockafellar’s [Roc70], 
or the author’s recent book [Ber09]. 
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In this paper, we survey the algorithmic properties of incremental methods in a unified framework, 
based on the author’s recent work on incremental proximal methods [BerlO] (an early version appears in 
the supplementary algorithms chapter of the book [Ber09]). In this section, we first provide an overview of 
representative applications, and then we discuss three types of incremental methods: gradient, subgradient, 
and proximal. We unify these methods, into a combined method, which we use as a vehicle for analysis later. 

1.1 Some Examples of Additive Cost Problems 

Additive cost problems of the form (1.1) arise in a variety of contexts. Let us provide a few examples where 
the incremental approach may have an advantage over alternatives. 

Example 1.1: (Least Squares and Related Inference Problems) 

An important context where cost functions of the form /»(*) arise is inference/machine learning, where 

each term fi{x) corresponds to error between some data and the output of a parametric model, with x being 
the vector of parameters. An example is linear least squares problems, where fi has quadratic structure, except 
for a regularization function. The latter function may be differentiable/quadratic, as in the classical regression 
problem 

m 

minimize ^~~^(a(a: — bj)^ + ^\\x — x\\^ 

i = l 

subject to a: e K", 

where x is given, or nondifferentiable, as in the l\-regularization problem 

m n 

minimize '^{a'iX - bif + 7 E \Xj\ 

i=l 

subject to a: = (a:i,..., Xn) € 5R”, 
which will be discussed further in Section 5. 

A more general class of additive cost problems is nonlinear least squares. Here 

fi{x) = (hi(a;))^ 

where hi(x) represents the difference between the ith of m measurements from a physical system and the output 
of a parametric model whose parameter vector is x. Problems of nonlinear curve fitting and regression, as well 
as problems of training neural networks fall in this category, and they are typically nonconvex. 

Another possibility is to use a nonquadratic function to penalize the error between some data and the 
output of the parametric model. For example in place of the squared error {a[x — bi)^, we may use 

fi{x) = liia'iX - hi), 

where £i is a convex function. This is a common approach in robust estimation and some support vector 
machine formulations. 

Still another example is maximum likelihood estimation, where fi is of the form 

fi{x) = -\ogPY{yi-,x), 

and yi,..., ym represent values of independent samples of a random vector whose distribution Py{-', x) depends 
on an unknown parameter vector x £ 5R” that we wish to estimate. Related contexts include “incomplete” data 
cases, where the expectation-maximization (EM) approach is used. 

The following four examples deal with broadly applicable problem structures that give rise to additive 
cost functions. 
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Example 1.2: (Dual Optimization in Separable Problems) 


Consider the problem 

m 

maximize E CiiVi) 

i=l 


subject to ^ giiVi) > 0 , Vi €Yi, i = 1,... ,m, 

i=l 

where Ci : i—>■ K and pi : i—>■ 5ft" are functions of a vector yi £ 5ft^, and Yi are given sets of 5ft^. Then by 
assigning a dual vector/multiplier x G 5ft" to the n-dimensional constraint function, we obtain the dual problem 


where 


minimize E fi{x) 

i=l 

subject to X > 0 , 


fiix)= sup {ci{yi) + x giivi)} , 

Vi^Yi 


which has the additive form (1.1). Here Yi is not assumed convex, so integer programming and other discrete 
optimization problems are included. However, the dual cost function components fi are always convex, and 
their values and subgradients can often be conveniently computed, particularly when yi is a scalar or Yi is a 
finite set. 


Example 1.3: (Problems with Many Constraints) 


Problems of the form 


minimize f{x) 

subject to gj(x) < 0, j = 1,... ,r, x e X, 

where the number r of constraints is very large often arise in practice, either directly or via 
other problems. They can be handled in a variety of ways. One possibility is to adopt 
approach, and replace problem ( 1 . 2 ) with 


( 1 . 2 ) 

reformulation from 
a penalty function 


minimize f{x) + c E P{gj{x)) 


(1.3) 


subject to x £ X, 

where P{-) is a scalar penalty function satisfying P{t) = 0 if t < 0, and P{t) > 0 if t > 0, and c is a 
positive penalty parameter. For example, one may use the quadratic penalty P{t) = (max{0, t}) , or the 
nondifferentiable penalty P{t) — max{0, t}. In the latter case, it can be shown that the optimal solutions of 
problems (1.2) and (1.3) coincide when c is sufficiently large (see for example [BNO03], Section 7.3, for the case 
where / is convex). The cost function of the penalized problem (1.3) is of the additive form (1.1). 


Set constraints of the form x G njTiXi, where Xi are closed sets, can also be handled by penalties in a 
way that gives rise to additive cost functions (a simpler but important special case where such constraints arise 
is the problem of Ending a common point within the sets Xi, i = 1,..., m; see Section 5.2). In particular, under 
relatively mild conditions, problem (1.2) with X = CiiLiXi is equivalent to the unconstrained minimization of 

r m 

f{x) + c'^P[gj{x)) +7 E dist(a;; Xi), 

j=l i=l 

where dist(a:; Xi) — min^gXi Hy — *11 and 7 is a sufficiently large penalty parameter. We discuss this possibility 
in Section 5.2. 
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Example 1.4: (Minimization of an Expected Value - Stochastic Programming) 

Consider the minimization of an expected value 

minimize 
subject to 

where If is a function of x and a random variable w taking a finite but very large number of values Wi, 
i = 1,... ,m, with corresponding probabilities ni. Here the cost function can be written as the sum of the m 
functions 'KiH{x,Wi). 

An example is stochastic programming, a classical model of two-stage optimization under uncertainty, 
where a vector x G X is selected at cost C{x), a random event occurs that has m possible outcomes wi,, Wm, 
and then another vector y is selected from some set Y with knowledge of the outcome that occurred. Then the 
optimal decision problem is to specify a vector yt £Y for each outcome Wi, and to minimize over x and yi the 
expected cost 

m 

C[x) + '^iTiGi(yi), 

i=l 

where Gi{yi) is the cost associated with the occurrence of Wi and TTi is the corresponding probability. This is 
a problem with an additive cost function. 

Additive cost function problems also arise from problem (1.4) in a different way, when the expected value 
i?|if(a;,ui)} is approximated by an m-sample average 


E{H{x,w)) 
X £ X, 


(1.4) 


E(a;) = — H(x,Wi), 

m ^' 

i — l 

where Wi are independent samples of the random variable w. The minimum of the sample average f(x) is then 
taken as an approximation of the minimum of E^JI(x,w)j. 

Example 1.5: (Weber Problem in Location Theory) 

A basic problem in location theory is to find a point x in the plane whose sum of weighted distances from a 
given set of points yi,..., pm is minimized. Mathematically, the problem is 


minimize Wi 11 ® — yi 11 

i=l 

subject to a; G K", 

where wi,..., Wm are given positive scalars. This problem descends from the famous Fermat-Torricelli-Viviani 
problem (see [BMS99] for an account of the history). The algorithmic approaches of the present paper would 
be of potential interest when the number of points m is large. We refer to Drezner and Hamacher [DrH04] for 
a survey of recent research, and to Beck and Teboulle [BeTlO] for a discussion that is relevant to our context. 

The structure of the additive cost function (1.1) often facilitates the use of a distributed computing 
system that is well-suited for the incremental approach. The following is an illustrative example. 
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Example 1.6: (Distributed Incremental Optimization Sensor Networks) 

Consider a network of m sensors where data are collected and are used to solve some inference problem involving 
a parameter vector x. If fi{x) represents an error penalty for the data collected by the ith sensor, the inference 
problem is of the form (1.1). While it is possible to collect all the data at a fusion center where the problem 
will be solved in centralized manner, it may be preferable to adopt a distributed approach in order to save 
in data communication overhead and/or take advantage of parallelism in computation. In such an approach 
the current iterate Xk is passed on from one sensor to another, with each sensor i performing an incremental 
iteration involving just its local component function fi, and the entire cost function need not be known at any 
one location. We refer to Blatt, Hero, and Gauchman [BHG08], and Rabbat and Nowak [RaN04], [RaN05] for 
further discussion. 

The approach of computing incrementally the values and subgradients of the components fi in a dis¬ 
tributed manner can be substantially extended to apply to general systems of asynchronous distributed compu¬ 
tation, where the components are processed at the nodes of a computing network, and the results are suitably 
combined, as discussed by Nedic, Bertsekas, and Borkar [NBBOl]. The analysis here relies on ideas from dis¬ 
tributed asynchronous gradient methods (both deterministic and stochastic), which were developed in the early 
80s by the author and his coworkers [Ber83], [TBA86], [BeT89]), and have been experiencing a resurgence 
recently (see e.g., Nedic and Ozdaglar [NeO09]). 

1.2 Incremental Gradient Methods - Differentiable Problems 


Let us consider first the case where the components fi are differentiable (not necessarily convex). Then, we 
may use incremental gradient methods, which have the form 

Xk+i = Px{xk - afcV/ 4 (xfc)), (1.5) 

where ak is a positive stepsize, Px{-) denotes projection on X, and ik is the index of the cost component 
that is iterated on. Such methods have a long history, particularly for the unconstrained case {X = K"), 
starting with the Widrow-Hoff least mean squares (LMS) method [WiH60] for positive semidefinite quadratic 
component functions (see e.g., [Luo91], [BeT96], Section 3.2.5, [Ber99], Section 1.5.2). They have also been 
used extensively for the training of neural networks, a case of nonquadratic/nonconvex cost components, 
under the generic name “backpropagation methods.” There are several variants of these methods, which 
differ in the stepsize selection scheme, and the order in which components are taken up for iteration (it could 
be deterministic or randomized). They are supported by convergence analyses under various conditions; see 
Luo [Luo91], Grippo [Gri93], [GriOO], Luo and Tseng [LuT94], Mangasarian and Solodov [MaS94], Bertsekas 
[Ber97], Solodov [Sol98], Tseng [Tse98]. 

When comparing the incremental gradient method with its classical nonincremental gradient counter¬ 
part [to = 1 and all components lumped into a single function F(x) = fiix)], there are two comple¬ 

mentary performance issues to consider: 

(a) Progress when far from convergence. Here the incremental method can be much faster. For an extreme 
case let X = 3?" (no constraints), and take to very large and all components fi identical to each other. 
Then an incremental iteration requires to times less computation than a classical gradient iteration, 
but gives exactly the same result, when the stepsize is appropriately scaled to be to times larger. While 
this is an extreme example, it reflects the essential mechanism by which incremental methods can be 
far superior: when the components fi are not too dissimilar, far from the minimum a single component 
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gradient will point to “more or less” the right direction [see also the discussion of [Ber97], and [Ber99] 
(Example 1.5.5 and Exercise 1.5.5)]. 

(b) Progress when elose to eonvergence. Here the incremental method is generally inferior. As we will 
discuss shortly, it converges at a sublinear rate because it requires a diminishing stepsize ak , compared 
with the typically linear rate achieved with the classical gradient method when a small constant stepsize 
is used (afc = a). One may use a constant stepsize with the incremental method, and indeed this may 
be the preferred mode of implementation, but then the method typically oscillates in the neighborhood 
of a solution, with size of oscillation roughly proportional to a, as examples and theoretical analysis 
show. 

To understand the convergence mechanism of incremental gradient methods, let us consider the case 
A = 5R", and assume that the component functions fi are selected for iteration according to a cyclic order 
[i.e., ik = {k modulo m) + 1], and let us assume that au is constant within a cycle (i.e., for all i = 0,1,.. 
aim = CK£m+i = • • • = a£m+m-i)- Then, viewing the iteration (1.5) in terms of cycles, we have for every k 
that marks the beginning of a cycle (ik = 1), 


Xk+m = Xk- aky^yfi(xk+i-i) = Xk- ak{'^F(xk) + Ck), (1.6) 

where F is the cost function/sum of components, F(x) = fiix), and Ck is given by 

m 

^ ^ iy fi(xk') ^ f i(xk+i — l)') 1 

i=l 

and may be viewed as an error in the calculation of the gradient 'Vf(xk). For Lipschitz continuous gradient 
functions V/i, the error Ck is proportional to ak, and this shows two fundamental properties of incremental 
gradient methods, which hold generally for the other incremental methods of this paper as well: 

(a) A constant stepsize (ak = a) typically cannot guarantee convergence, since then the size of the gradient 
error ||efc|| is typically bounded away from 0. Instead (in the case of differentiable components fi) a 
peculiar form of convergence takes place for constant but sufficiently small a, whereby the iterates 
within cycles converge but to different points within a sequence of m points (i.e., the sequence of first 
points in the cycles converges to a different limit than the sequence of second points in the cycles, etc). 
This is true even in the most favorable case of a linear least squares problem (see Luo [Luo91], or the 
textbook analysis of [Ber99], Section 1.5.1). 

(b) A diminishing stepsize [such as ak = 0(1/k)] leads to diminishing error Ck, so (under the appropriate 
Lipschitz condition) it can result in convergence to a stationary point of /. 

A corollary of these properties is that the price for achieving convergence is the slow (sublinear) 
asymptotic rate of convergence associated with a diminishing stepsize, which compares unfavorably with the 
often linear rate of convergence associated with a constant stepsize and the nonincremental gradient method. 
However, in practical terms this argument does not tell the entire story, since the incremental gradient method 
often achieves in the early iterations a much faster convergence rate than its nonincremental counterpart. In 
practice, the incremental method is usually operated with a stepsize that is either constant or is gradually 
reduced up to a positive value, which is small enough so that the resulting asymptotic oscillation is of no 
essential concern. An alternative, is to use a constant stepsize throughout, but reduce over time the degree 
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of incrementalism, so that ultimately the method becomes nonincremental and achieves a linear convergence 
rate (see [Ber97], [Sol98]). 

Aside from extensions to nonidifferentiable cost problems, for X = IR", there is an important variant 
of the incremental gradient method that involves extrapolation along the direction of the difference of the 
preceding two iterates: 

Xk+i =Xk- ctkXfi^ (xk) + P{xk - Xk-i), (1.7) 

where /3 is a scalar in [0,1) and x-i = xq (see e.g., [MaS94], [Tse98], [Ber96], Section 3.2). This is sometimes 
called incremental gradient method with momentum. The nonincremental version of this method is the heavy 
hall method of Polyak [Pol64], which can be shown to have faster convergence rate than the corresponding 
gradient method (see [Pol87], Section 3.2.1). A nonincremental method of this type, but with variable 
and suitably chosen value of /3, has been proposed by Nesterov [Nes83], and has received a lot of attention 
recently because it has optimal iteration complexity properties under certain conditions (see Nesterov [Nes04], 
[Nes05], Lu, Monteiro, and Yuan [LMY08], Tseng [Tse08], Beck and Teboulle [BeT09], [BeTlO]). However, 
no incremental analogs of this method with favorable complexity properties are currently known. 

Another variant of the incremental gradient method for the case AT = 5R” has been proposed by Blatt, 
Hero, and Gauchman [BHG08], which (after the first m iterates are computed) has the form 

m—1 

Xk+i = Xk - fi^_^{xk-i) (1.8) 

fco 

[for k < m, the summation should go up to £ = k, and a should be replaced by a corresponding larger 
value, such as ak = ma/(k + 1)]. This method also computes the gradient incrementally, one component 
per iteration, but in place of the single component gradient Xfif_(xk) in Eq. (1.5), it uses an approximation 
to the total cost gradient Vf{xk), which is an aggregate of the component gradients computed in the 
past m iterations. A cyclic order of component function selection [ik = (k modulo m) + 1] is assumed in 
[BHG08], and a convergence analysis is given, including a linear convergence rate result for a sufficiently 
small constant stepsize a and quadratic component functions fi. It is not clear how iterations (1.5) and 
(1.8) compare in terms of rate of convergence, although the latter seems likely to make faster progress when 
close to convergence. Note that iteration (1.8) bears similarity to the incremental gradient iteration with 
momentum (1.7) where /3 « 1. In particular, when ak = a, the sequence generated by Eq. (1.7) satisfies 

k 

Xk+1= Xk-fi^_^{xk-t) (1.9) 

fco 

[both iterations (1.8) and (1.9) involve different types of diminishing dependence on past gradient compo¬ 
nents]. There are no known analogs of iterations (1.7) and (1.8) for nondifferentiable cost problems. 

Among alternative incremental methods for differentiable cost problems, let us also mention versions 
of the Gauss-Newton method for nonlinear least squares problems, based on the extended Kalman filter 
(Davidon [Dav76], Bertsekas [Ber96], and Moriyama, Yamashita, and Fukushima [MYF03]). They are 
mathematically equivalent to the ordinary Gauss-Newton method for linear least squares, which they solve 
exactly after a single pass through the component functions fi, but they often perform much faster than the 
latter in the nonlinear case, particularly when m is large. 

Let us hnally note that incremental gradient methods are also related to stochastic gradient methods, 
which aim to minimize an expected value E\^H{x,w)} (cf. Example 1.2) by using the iteration 

Xk+i = Xk- akXH{xk,Wk), 
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where Wk is a sample of the random variable w. These methods also have a long history (see Polyak and 
Tsypkin [PoT73], Ljung [Lju77], Kushner and Clark [KuC78], Tsitsiklis, Bertsekas, and Athans [TBA86], 
Polyak [Pol87], Bertsekas and Tsitsiklis [BeT89], [BeT96], [BeTOO], Gaivoronskii [Gai93], Pfiug [Pfl96], 
Kushner and Yin [KuY97], Bottou [BotOS], Meyn [Mey07], Borkar [Bor08], Nemirovski et. al [NJL09], Lee 
and Wright [LeWlO]), and are strongly connected with stochastic approximation algorithms. The main 
difference between stochastic and deterministic formulations is that the former involve sequential sampling 
of cost components from an infinite population under some statistical assumptions, while in the latter the 
set of cost components is predetermined and finite. However, it is possible to view the incremental gradient 
method (1.5), with a randomized selection of the component function fi (i.e., with ik chosen to be any one 
of the indexes 1,..., m, with equal probability 1 /to), as a stochastic gradient method (see [BeT96], Example 
4.4, [BeTOO], Section 5). 

The stochastic formulation of incremental methods just discussed highlights an important application 
context where the component functions fi are not given a priori, but rather become known sequentially 
through some observation process. Then it often makes sense to use an incremental method to process the 
component functions as they become available, and to obtain approximate solutions as early as possible. 
In fact this may be essential in time-sensitive and possibly time-varying environments, where solutions are 
needed “on-line.” In such cases, one may hope than an adequate estimate of the optimal solution will be 
obtained, before all the functions fi are processed for the first time. 

1.3 Incremental Subgradient Methods - NondifFerentiable Problems 

We now discuss the case where the component functions fi are convex and nondifferentiable at some points, 
and consider incremental subgradient methods. These are similar to their gradient counterparts (1.5) except 
that an arbitrary subgradient ^fif.(xk) of the cost component fif. is used in place of the gradient:! 

Xk+i = Px{xk - ctkV fif^ixk)). (1-10) 

Such methods were first proposed in the general form (1.10) in the Soviet Union by Kibardin [Kib80], 
following the earlier paper by Litvakov [Lit66] (which considered convex/nondifferentiable extensions of linear 
least squares problems) and other related subsequent proposals.]: These works remained unnoticed in the 
Western literature, where incremental methods were reinvented often in different contexts and with different 
lines of analysis; see Solodov and Zavriev [SoZ98], Bertsekas [Ber99] (Section 6.3.2), Ben-Tal, Margalit, 
and Nemirovski [BMNOl], Nedic and Bertsekas [NeBOO], [NeBOl], [NeBlO], Nedic, Bertsekas, and Borkar 
[NBBOl], Kiwiel [Kiw04], Rabbat and Nowak [RaN04], [RaN05], Gaudioso, Giallombardo, and Miglionico 
[GGM06], Shalev-Shwartz et. al. [SSS07], Helou and De Pierro [HeD09], Johansson, Rabi, and Johansson 

] In this paper, we use S/ fix) to denote a subgradient of a convex function / at a vector x, i.e, a vector such that 
/(•*) > fix) + fix)'iz — x) for all x £ 5ft". The choice of V/(x) from within the set of all subgradients at x [the 
subdifferential at x, denoted dfix)] will be clear from the context. Note that if / is real-valued, dfix) is nonempty 
and compact for all x. If / is differentiable at x, dfix) consists of a single element, the gradient V fix). 

] Generally, in the 60s and 70s, algorithmic ideas relating to simple gradient methods with and without determin¬ 
istic and stochastic errors were popular in the Soviet scientific community, partly due to an emphasis on stochastic 
iterative algorithms, such as pseudogradient and stochastic approximation; the works of Ermoliev, Polyak, and Tsyp¬ 
kin, to name a few of the principal contributors, are representative [Erm69], [PoT73], [Erm76], [Pol78], [Pol87]. By 
contrast the emphasis in the Western literature at the time was in more complex Newton-like and conjugate direction 
methods. 



[JRJ09], Predd, Kulkarni, and Poor [PKP09], and Ram, Nedic, Veeravalli [RNV09], [RNV09], and Duchi, 
Hazan, and Singer [DHSIO]. 

Incremental subgradient methods have convergence characteristics that are similar in many ways to 
their gradient counterparts, the most important similarity being the necessity for a diminishing stepsize ak 
for convergence. The lines of analysis, however, tend to be different, since incremental gradient methods rely 
for convergence on arguments based on decrease of the cost function value, while incremental subgradient 
methods rely on arguments based on decrease of the iterates’ distance to the optimal solution set. The line of 
analysis of the present paper is of the latter type, similar to earlier works of the author and his collaborators 
(see [NeBOO], [NeBOl], [NBBOl], and the textbook presentations in [Ber99], [BNO03]). 

Note two important ramifications of the lack of differentiability of the component functions ff. 

(1) Convexity of fi becomes essential, since the notion of subgradient is connected with convexity (subgra- 
dient-like algorithms for nondifferentiable/nonconvex problems have been suggested in the literature, 
but tend to be complicated and have not found much application thus far). 

(2) There is more reason to favor the incremental over the nonincremental methods, since (contrary to 
the differentiable case) nonincremental subgradient methods also require a diminishing stepsize for 
convergence, and typically achieve a sublinear rate of convergence. Thus the one theoretical advantage 
of the nonincremental gradient method discussed earlier is not shared by its subgradient counterpart. 

Let us finally mention that just as in the differentiable case, there is a substantial literature for stochastic 
versions of subgradient methods. In fact, as we will discuss in this paper, there is a potentially significant 
advantage in turning the method into a stochastic one by randomizing the order of selection of the components 
fi for iteration. 

1.4 Incremental Proximal Methods 


We now consider an extension of the incremental approach to proximal algorithms. The simplest one for 
problem (1.1) is of the form 


Xk+i = arg min 

xGX 


/*fc(a;) + -—||x-Xfc||2 

2ak 


( 1 . 11 ) 


which relates to the proximal minimization algorithm (Martinet [Mar70], Rockafellar [Roc76]) in the same 
way that the incremental subgradient method (1.10) relates to the classical nonincremental subgradient 
method.t Here {ofe} is a positive scalar sequence, and we will assume that each /i : 5R" i-> is a convex 
function, and X is a nonempty closed convex set. The motivation for this type of method, which was proposed 
only recently in [BerlO], is that with a favorable structure of the components, the proximal iteration (1.10) 
may be obtained in closed form or be relatively simple, in which case it may be preferable to a gradient or 
subgradient iteration. In this connection, we note that generally, proximal iterations are considered more 
stable than gradient iterations; for example in the nonincremental case, they converge essentially for any 
choice of ak, while this is not so for gradient methods. 

f In this paper we restrict attention to proximal methods with the quadratic regularization term ||® — Xk\\'^■ Our 
approach is applicable in principle when a nonquadratic term is used instead in order to match the structure of the 
given problem. The discussion of such alternative algorithms is beyond our scope. 
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Unfortunately, while some cost function components may be well suited for a proximal iteration, others 
may not be because the minimization (1.11) is inconvenient, and this leads us to consider combinations of 
gradient/subgradient and proximal iterations. In fact this has motivated in the past nonincremental combi¬ 
nations of gradient and proximal methods for minimizing the sum of two functions (or more generally, finding 
a zero of the sum of two nonlinear operators). These methods have a long history, dating to the splitting 
algorithms of Lions and Mercier [LiM79], Passty [Pas79], and Spingarn [Spi85], and have become popular 
recently (see Beck and Teboulle [BeT09], [BeTlO], and the references they give to specialized algorithms, 
such as shrinkage/thresholding, cf. Section 5.1). Let us also note that splitting methods are related to alter¬ 
nating direction methods of multipliers (see Gabay and Mercier [GaM76], [Gab83], Bertsekas and Tsitsiklis 
[BeT89], Eckstein and Bertsekas [EcB92]), which are presently experiencing a revival as viable (nonincre¬ 
mental) methods for minimizing sums of component functions (see the survey by Boyd et. al. [BPGIO], which 
contains extensive references to recent work and applications, and the complexity-oriented work of Goldfarb, 
Ma, and Scheinberg [GoM09], [GMSIO]). 

With similar motivation in mind, we adopt in this paper a unified algorithmic framework that includes 
incremental gradient, subgradient, and proximal methods, and their combinations, and serves to highlight 
their common structure and behavior. We focus on problems of the form 


m 

minimize F{x)'^= Fj (x) 

i=l 

subject to X G X, 


( 1 . 12 ) 


where for all i, 

Fi{x) = fi{x) + hi{x), (1-13) 

/i : 9?” !->■ 5R and hi : K” i-s- 5R are real-valued convex functions, and X is a nonempty closed convex set. 

In Section 2, we consider several incremental algorithms that iterate on the components fi with a 
proximal iteration, and on the components hi with a subgradient iteration. By choosing all the fi or all 
the hi to be identically zero, we obtain as special cases the subgradient and proximal iterations (l.IO) and 
(l.II), respectively. However, our methods offer greater flexibility, and may exploit the special structure of 
problems where the functions fi are suitable for a proximal iteration, while the components hi are not and 
thus may be preferably treated with a subgradient iteration. 

In Section 3, we discuss the convergence and rate of convergence properties of methods that use a cyclic 
rule for component selection, while in Section 4, we discuss the case of a randomized component selection 
rule. In summary, the convergence behavior of our incremental methods is similar to the one outlined earlier 
for the incremental subgradient method (1.10). This includes convergence within a certain error bound for 
a constant stepsize, exact convergence to an optimal solution for an appropriately diminishing stepsize, and 
improved convergence rate/iteration complexity when randomization is used to select the cost component 
for iteration. In Section 5 we illustrate our methods for some example applications. 


2. INCREMENTAL SUBGRADIENT-PROXIMAL METHODS 


In this section, we consider problem (1.12)-(1.13), and introduce several incremental algorithms that involve 
a combination of a proximal and a subgradient iteration. One of our algorithms has the form 


Zk = arg min 

x^X 





( 2 . 1 ) 
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^k+1 — Pxi^^k ^ ( 2 . 2 ) 

where Vhi^{zk) is an arbitrary subgradient of ft-ij, at Zk- Note that the iteration is well-defined because 
the minimum in Eq. (2.1) is uniquely attained since fi is continuous and ||x — Xfcjp is real-valued, strictly 
convex, and coercive, while the subdifferential dhi{zk) is nonempty since hi is real-valued. Note also that 
by choosing all the ft or all the hi to be identically zero, we obtain as special cases the subgradient and 
proximal iterations (1-10) and (1.11), respectively. 

The iterations (2.1) and (2.2) maintain both sequences {zk} and {xk} within the constraint set X, but 
it may be convenient to relax this constraint for either the proximal or the subgradient iteration, thereby 
requiring a potentially simpler computation. This leads to the algorithm 

Zk = arg mm , (2.3) 

Xk+i = Px{zk - akXhi^{zk)), (2.4) 

where the restriction x G X has been omitted from the proximal iteration, and the algorithm 

Zk = Xk - akXhi^^{xk), (2.5) 

Xk+i = argmin I /*Jx) -h - Zk\\A , (2.6) 

xGX ZCXk J 

where the projection onto X has been omitted from the subgradient iteration. It is also possible to use 
different stepsize sequences in the proximal and subgradient iterations, but for notational simplicity we will 
not discuss this type of algorithm. 

All of the incremental proximal algorithms given above are new to our knowledge, having first been 
proposed in the author’s recent paper [BerlO] and the on-line chapter of the book [Ber09]. The closest 
connection to the existing proximal methods is the “proximal gradient” method, which has been analyzed 
and discussed recently in the context of several machine learning applications by Beck and Teboulle [BeT09], 
[BeTlO] (it can also be interpreted in terms of splitting algorithms [LiM79], [Pas79]). This method is 
nonincremental, applies to differentiable hi, and contrary to subgradient and incremental methods, it does 
not require a diminishing stepsize for convergence to the optimum. In fact, the line of convergence analysis 
of Beck and Teboulle relies on the differentiability of hi and the nonincremental character of the proximal 
gradient method, and is thus different from ours. 

Part (a) of the following proposition is a key fact about incremental proximal iterations. It shows 
that they are closely related to incremental subgradient iterations, with the only difference being that the 
subgradient is evaluated at the end point of the iteration rather than at the start point. Part (b) of the 
proposition provides an inequality that is well-known in the theory of proximal methods, and will be useful 
for our convergence analysis. In the following, we denote by ri(5') the relative interior of a convex set S, and 
by dom(/) the effective domain {x | f{x) < oo} of a function / : K" >->• (—oo, oo]. 


Proposition 2.1: Let AT be a nonempty closed convex set, and let / : JR" i-^- (—(X),oo] be a closed 
proper convex function such that ri(X) n ri(dom(/)) ^ 0. For any Xk G JR" and ak > 0, consider the 


proximal iteration 


Xk+i = argmin 

xGX 


fix) 


2ak 


\X - Xk\ 


(2.7) 
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(a) The iteration can be written as 

Xk+i = Px{xk - afcV/(a;fe+i)), i = 

where Vf(xk+i) is some subgradient of / at Xk+i- 

(b) For all t/ € X, we have 

Ikfc+i - j/||2 < \\xk - j/lp - 2ak{fixk+i) - f{y)) - \\xk - Xk+i 

< \\xk - y\V - 2ak{f(xk+i) - fiy))- 


( 2 . 8 ) 


(2.9) 


Proof: (a) We use the formula for the subdifferential of the sum of the three functions /, {l/2ak)\[X — Xk\\^, 
and the indicator function of X (cf. Prop. 5.4.6 of [Ber09]), together with the condition that 0 should belong 
to this subdifferential at the optimum Xk+i- We obtain that Eq. (2.7) holds if and only if 

— {xk - Xk+i) G df{xk+i) + Nx{xk+i), (2.10) 

Ctk 

where Nx{xk+i) is the normal cone of X at Xk+i [the set of vectors y such that y'{x — Xk+i) < 0 for all 
X € X, and also the subdifferential of the indicator function of X at Xk+i] see [Ber09], p. 185]. This is true 
if and only if 

Xk - Xk+i - akXf(xk+i) G Nxixk+i), 

for some Xf{xk+i) G df(xk+i), which in turn is true if and only if Eq. (2.8) holds, by the projection theorem, 
(b) We have 

\\xk - 1/IP = \\xk - Xk+i +Xk +1 - 1 /IP = \\xk - Xfc+ilp - 2{xk - Xk+iYiy - Xk+i) + ||xfc+i - i/|p. (2.11) 

Also since from Eq. (2.10), —(xk —Xk+i) is a subgradient at Xk+i of the sum of / and the indicator function 
of X, we have (using also the assumption y G X) 

f{xk+i) + —{xk - Xk+iYiy - Xk+i) < f{y). 

Oik 

Combining this relation with Eq. (2.11), the result follows. Q.E.D. 

Based on Prop. 2.1(a), we see that all the preceding iterations can be written in an incremental 
subgradient format: 

(a) Iteration (2.1)-(2.2) can be written as 

Zk = Px{xk - o-kXfi^{zk)), Xk+i = Px{zk - ctk'^hi^izk)). ( 2 - 12 ) 

(b) Iteration (2.3)-(2.4) can be written as 

Zk — Xk OikX fi^i^ZkY Xk-\-l — Pxi^Zk kXkX hi^i^ZkY) • (2T3) 
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(2.14) 


(c) Iteration (2.5)-(2.6) can be written as 

Zk= Xk- ak'^hi^^{xk), Xk+i = Px{zk - afeV/ij^(xfc+i)). 


Note that in all the preceding updates, the subgradient Vhij, can be any vector in the subdifferential of hi,,, 
while the subgradient ^ fif. must be a specific vector in the subdifferential of fif., specified according to Prop. 
2.1(a). Note also that iteration (2.13) can be written as 

Xk+i = Px (xk - ctk^Fif^ {zk )), 

and resembles the incremental subgradient method for minimizing over X the cost F{x) = Pii^) [cf- 

Eq. (1.12)], the only difference being that the subgradient of Fif. is taken at Zk rather than Xk- 

An important issue which affects the methods’ effectiveness is the order in which the components 
{fi, hi} are chosen for iteration. In this paper, we consider two possibilities: 

(1) A cyclic order, whereby {fi,hi} are taken up in the fixed deterministic order 1 ,...,to, so that ik is 
equal to (fc modulo m) plus 1. A contiguous block of iterations involving {/i, hi},..., {/m, hm} in this 
order and exactly once is called a cycle. We assume that the stepsize ak is constant within a cycle (for 
all k with ifc = 1 we have ak = ak+i ... = ak+m-i)- 

(2) A randomized order based on uniform sampling, whereby at each iteration a component pair {fi,hi} 
is chosen randomly by sampling over all component pairs with a uniform distribution, independently 
of the past history of the algorithm. 

It is essential to include all components in a cycle in the cyclic case, and to sample according to the uniform 
distribution in the randomized case, for otherwise some components will be sampled more often than others, 
leading to a bias in the convergence process. 

Another technique for incremental methods, popular in neural network training practice, is to reshuffle 
randomly the order of the component functions after each cycle. This alternative order selection scheme leads 
to convergence, like the preceding two. Moreover, this scheme has the nice property of allocating exactly one 
computation slot to each component in an m-slot cycle (m incremental iterations). By comparison, choosing 
components by uniform sampling allocates one computation slot to each component on the average, but 
some components may not get a slot while others may get more than one. A nonzero variance in the number 
of slots that any fixed component gets within a cycle, may be detrimental to performance, and indicates that 
reshuffling randomly the order of the component functions after each cycle works better; this is consistent 
with experimental observations shared with us by B. Recht (private communication). While it seems difficult 
to establish this fact analytically, a justification is suggested by the view of the incremental method as a 
gradient-like method that uses as descent direction the true gradient at the start of the cycle plus an “error” 
[due to the calculation of the component gradients at points intermediate within a cycle; cf. Eq. (1.6)]. The 
error has apparently greater variance in the uniform sampling method than in the randomly shuffled order 
method (in fact the variance of the error would seem relatively larger as m increases, although other factors 
such as variance of size of component gradients would also play a role). Heuristically, if the variance of the 
error is larger, the direction of descent deteriorates, suggesting slower convergence. In this paper, we will 
focus on the easier-to-analyze uniform sampling method, and show by analysis that it is superior to the 
cyclic order. 

For the remainder of the paper, we denote by F* the optimal value of problem (1.12): 

F* = inf F{x), 

x^X 
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and by X* the set of optimal solutions (which could be empty): 

X* = [x* \x* € X, F{x*) = F*}. 

Also, for a nonempty closed convex set X, we denote by dist(-; AT) the distance function given by 

dist(a;; A) = min ||a: — z||, a; S JR”. 

z^X 

In our convergence analysis of Section 4, we will use the following well-known theorem (see Neveu 
[Nev75], p. 33). We will use a much simpler deterministic version of the theorem in Section 3. 


Proposition 2.2: (Supermartingale Convergence Theorem) Let Yfc, Zk, and Wk, k = 0,1,.. 
be three sequences of random variables and let Fk, A: = 0,1,..., be sets of random variables such that 
Fk C Fk+i for all k. Suppose that: 

(1) The random variables Yk, Zk, and Wk are nonnegative, and are functions of the random variables 
in Fk- 

(2) For each k, we have 

E{Yk+i \Fk)<Yk-Zk + Wk. 

(3) There holds, with probability 1, J2T=o < oo. 

Then we have < oo, and the sequence Yk converges to a nonnegative random variable Y, 

with probability 1. 


3. CONVERGENCE FOR METHODS WITH CYCLIC ORDER 

In this section, we discuss convergence under the cyclic order. We consider a randomized order in the next 
section. We focus on the sequence {xk} rather than {zk}, which need not lie within X in the case of iterations 
(2.13) and (2.14) when X ^ JR". In summary, the idea is to show that the effect of taking subgradients of /i 
or hi at points near Xk (e.g., at Zk rather than at Xk) is inconsequential, and diminishes as the stepsize ak 
becomes smaller, as long as some subgradients relevant to the algorithms are uniformly bounded in norm by 
some constant. This is similar to the convergence mechanism of incremental gradient methods described in 
Section 1.2. We use the following assumptions throughout the present section. 


Assumption 3.1: [For iterations (2.12) and (2.13)] There is a constant c € JR such that for 
all k 

max{\\Xfi^{zk)\\,\\Vh^|^{zk)\\} < c. (3.1) 

Furthermore, for all k that mark the beginning of a cycle (i.e., all A: > 0 with ik = 1), we have 

max{fj{xk) - fj{zk+j-i), hj{xk) - hj{zk+j-i)) < c\\xk - Zk+j-i\\, V j = 1,... ,m. (3.2) 
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Assumption 3.2: [For iteration (2.14)] There is a constant c G 5R such that for all k 

max{||V/4(xfc+i)||, ||Vfoj^(a;fc)||} < c. (3.3) 

Furthermore, for all k that mark the beginning of a cycle (i.e., all fc > 0 with ik = 1), we have 

max {/^(xfc) - fj{xk+j-i), hj{xk) - hj{xk+j-i)} < c \\xk - Xk+j-iW^ V j = 1,... ,m, (3.4) 

/j(xfc+j_i) - /^(xfe+j) < c||xfc+j_i-Xfc+jll, Vj = l,...,m. (3.5) 


Note that the condition (3.2) is satisfied if for each i and k, there is a subgradient of fi at Xk and a 
subgradient of hi at Xk, whose norms are bounded by c. Conditions that imply the preceding assumptions 
are: 

(a) For algorithm (2.12): fi and hi are Lipschitz continuous over the set X. 

(b) For algorithms (2.13) and (2.14): fi and hi are Lipschitz continuous over the entire space 5ft". 

(c) For all algorithms (2.12), (2.13), and (2.14): fi and hi are polyhedral [this is a special case of (a) and 

(b)]. 

(d) For all algorithms (2.12), (2.13), and (2.14): The sequences {x^} and {zk} are bounded [since then 
fi and hi, being real-valued and convex, are Lipschitz continuous over any bounded set that contains 
{xk} and {zk}]. 

The following proposition provides a key estimate that reveals the convergence mechanism of our 
methods. 


Proposition 3.1: Let {x^} be the sequence generated by any one of the algorithms (2.12)-(2.14), 
with a cyclic order of component selection. Then for all y G A and all k that mark the beginning of 
a cycle (i.e., all k with ik = 1), we have 

\\xk+m - 2/P < ||xfc - 2/P - 2ak{F{xk) - F{y)) + alPirFc^, (3.6) 

where B = — A. 

^ m. ' 


Proof: We first prove the result for algorithms (2.12) and (2.13), and then indicate the modifications 
necessary for algorithm (2.14).f Using Prop. 2.1(b), we have for all // G A and k, 

pfc-2/P < \\xk - y\\^ - 2ak{fi^,{zk) - fi^,{y)) ■ (3.7) 

f The original version of this report gave /3 = A -(- 4 for the case of algorithms (2.12) and (2.13), and B = ^ + 4 
for the case of algorithm (2.14), because a loose bound was used in the following calculation. The tighter version 
given here was prompted by an observation by M. Andersen and P. C. Hansen in Oct. 2013. 
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Also, using the nonexpansion property of the projection [i.e., ||Px(u) — -Px(u)|| < ||u — u|| for all u,v £ IR"], 
the definition of subgradient, and Eq. (3.1), we obtain for all y S A and k, 

\\xk+i - yP = \\Px{zk - akVh^^{zk)) - y||^ 

< ||zfc-afcVhiP^fe)-y||2 

' 9 11 '” 112 

= \\zk - y|p - 2akVhi^{zky{zk - y) + afc||V/i*j^(zfe)|| 

< \\zk - y|p - 2ak{h^^^{zk) - 

Combining Eqs. (3.7) and (3.8), and using the definition Fj = fj + hj, we have 

\\xk+i -yP < IPfc -yP - 2ak{f^^,{zk) + hi^^izk) - fikiv) -h^^{y)) + alc^ 

= \\xk - yP - 2ak{F^^{zk) - F,^{y)) + alc^. 


Let now k mark the beginning of a cycle (i.e., ik = 1). Then at iteration k + j — 1, j = 1,... ,m, the 
selected components are {fj, hj}, in view of the assumed cyclic order. We may thus replicate the preceding 
inequality with k replaced byfc + l,...,fc + m — 1, and add to obtain 

m 

\\xk+m - yP < \\xk - yP - 2ak {Fj{zk+j-i) - Fj{y)) + malc^, 

i=i 

or equivalently, since F = Fj, 

m 

\\xk+m - yP < \\xk - yP - 2ak{F{xk) - F{y)) + malc^ + 2ak'Y “ Fj{zk+j-i)). (3.10) 

j=i 

The remainder of the proof deals with appropriately bounding the last term above. 

From Eq. (3.2), we have for j = 1,..., m, 

FjijXk) Fji^Zk-\-j — \) ^ 2c Ip/,; Z/;;_|_J_1 II . (3.1l) 

We also have 

\\xk Zk-jj — 1 II ^ \\xk Xk-jl || T * ' * “t“ |pfc+j —2 Xk-jj — 1 || “t“ |pfc+j —1 Zk-jj — 1 1|, (^*^^) 

and by the definition of the algorithms (2.12) and (2.13), the nonexpansion property of the projection, and 
Eq. (3.1), each of the terms in the right-hand side above is bounded by 2akC, except for the last, which is 
bounded by akC. Thus Eq. (3.12) yields ||x/; — Zk+j-i\\ < oik{2j — l)c, which together with Eq. (3.11), shows 
that 

Fj{xk) - Fj{zk+j-i) < 2akc‘^{2j - 1). (3.13) 

Combining Eqs. (3.10) and (3.13), we have 

m 

Ipfc+m - yP < Ipfc - yP - 2ak{F{xk) - F{y)) + -h 4a|c2 ^(2j - 1), 


and finally 

Ipfc+m - yP < Ipfc - yP - 2ak{F{xk) - F{y)) + 
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which is of the form (3.6) with /3 = ^ + 4. 

For the algorithm (2.14), a similar argument goes through using Assumption 3.2. In place of Eq. (3.7), 
using the nonexpansion property of the projection, the definition of subgradient, and Eq. (3.3), we obtain 
for all y € A and fc > 0, 

\\zk - y|p < \\xk - 2 /IP - 2ak{hi^{xk) - hi^{y)) + alc^, (3-14) 

while in place of Eq. (3.8), using Prop. 2.1(b), we have 

\\xk+i - y|p < \\zk-yP - 2ak{f^^:ixk+l) - hkiv))- (3-15) 

Combining these equations, in analogy with Eq. (3.9), we obtain 

Ikfc+i -2/lp < \\xk - yP - 2ak{f^k{xk+l) + h^^{xk) - f^k{y) - + alc^ 

. s / X (3.16) 

= llxfc - y|p - 2ak{FiPxk) - Fi^y)) + + 2ak{fiPxk) - fi^Xk+i))- 


As earlier, we let k mark the beginning of a cycle (i.e., ik = 1). We replicate the preceding inequality 
with k replaced byfc + l,...,fc + m — 1, and add to obtain [in analogy with Eq. (3.10)] 


||a;fe+m - y|p < ||a:fc - y|p - 2ak{F{xk) - F{y)) + 

m m 

+ 2ak ^ ( {Fj(xk) — Fj{xk+j-i)) + 2ak ^ ( {fj{xk+j-i) ~ fj{xk+j))- 

i=i i=i 


(3.17) 


We now bound the two sums in Eq. (3.17), using Assumption 3.2. From Eq. (3.4), we have 

Fj {xk) Fj {xk+j — l ) ^ 2c||x/j; Xk-\-j — l II ^ 2c^||xfc Xk-\-l II 4“ * * * “t“ ||x/^-|-j_2 Xk-kj—l ||) j 

and since by Eq. (3.3) and the definition of the algorithm, each of the norm terms in the right-hand side 
above is bounded by 2akC, 

Fj{xk) - Fj{xk+j-i) < ^akC^ij - 1). 

Also from Eqs. (3.3) and (3.5), and the nonexpansion property of the projection, we have 

fj^Xk-kj — l) fj{xk-\-j) ^ C ||xfc-|_j_l Xfc-|_j|[ ^ 2cXkC‘^- 


Combining the preceding relations and adding, we obtain 


mm m 

2ak (Fjixk) - Fj{xk+j-i)) + 2ak Y (fjPk+j-i) - fjixk+j)) < ^(j - 1) + Aalc^m 


i=i 


i=i 

= 4a^c2(m2 — m) + 4:a\c^m 



i=i 


which together with Eq. (3.17), yields Eq. (3.6). Q.E.D. 


Among other things, Prop. 3.1 guarantees that with a cyclic order, given the iterate Xk at the start of 
a cycle and any point y € X having lower cost than Xk (for example an optimal point), the algorithm yields 
a point Xk+m at the end of the cycle that will be closer to y than Xk, provided the stepsize ak is less than 

2{Fixk) - F{y)) 
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In particular, for any e > 0 and assuming that there exists an optimal solution a;*, either we are within 
+ e of the optimal value, 




or else the squared distance to x* will be strictly decreased by at least 2ake, 


\\xk+m - x*\\^ < \\xk - a;*|p - 2afee. 


Thus, using Prop. 3.1, we can provide various types of convergence results. As an example, for a constant 
stepsize {ak = a), convergence can be established to a neighborhood of the optimum, which shrinks to 0 as 
q; —>■ 0, as stated in the following proposition. 


Proposition 3.2: Let {a:fc} be the sequence generated by any one of the algorithms (2.12)-(2.14), 
with a cyclic order of component selection, and let the stepsize ak be hxed at some positive constant 

a. 

(a) If F* = —oo, then 

liminf F(xk) = F*. 

(b) If F* > —CO, then 

k—¥oo 

limM F{xk) < F* + , 


k—¥oo 2 

where c and P are 

the constants of Prop. 3.1. 


Proof: We prove (a) and (b) simultaneously. If the result does not hold, there must exist an e > 0 such 
that 

liminf F{xkm) - ^ - 2e> F*. 

k—^oo Z 

Let y € X he such that 

limmfF(a;fcm) - - 2e > F{y), 

k—¥oo Z 

and let ko be large enough so that for all k > ko, we have 

F{xkm) > liminfF(xfcm) - £• 

k—¥oo 

By combining the preceding two relations, we obtain for all k > ko, 

FM-FW)>^^ + e. 

Using Prop. 3.1 for the case where y = y together with the above relation, we obtain for all k > ko, 
l|a;(fc+i)m - 2/P < \\xkm - yP - 2a[F{xkm) - F{y)) + Pa^rrFc^ < \\xkm - yP - 2ae. 
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This relation implies that for all k > ko, 

||a:(fc+i)m - y|p < ||a;(fc-i)m - y|p - 4Q;e < • • • < \\xkQ - yP - 2(fc + 1 - fco)ae, 
which cannot hold for k sufficiently large - a contradiction. Q.E.D. 

The next proposition gives an estimate of the number of iterations needed to guarantee a given level 
of optimality up to the threshold tolerance al3w?c^ /2 of the preceding proposition. 


Proposition 3.3: Assume that X* is nonempty. Let {xk} be a sequence generated as 

in Prop. 3.2. 

Then for e > 0, we have 



mi„ + 

o<fc<Ar ^ - 2 

(3.18) 

where N is given by 

dist(a:o; A:*)2 

N = m - . 

(3.19) 


L ae J 


Proof: Assume, to arrive at a contradiction, that Eq. (3.18) does not hold, so that for all k with 0 < km < 
N, we have 

F(n„} > F. + 

By using this relation in Prop. 3.1 with at replaced by a and y equal to the vector of X* that is at minimum 
distance from Xkm, we obtain for all k with 0 < km < N, 

dist(a;(fe+i)^; A*)2 < dist(a;fcm; A:*)2 -2a(F{xkm) - F*)+a^Pm?c^ 

< dist(a;fcm; X*Y — + ae) + a^Pm^c^ 

= dist(a;fcm; A*)2 - ae. 

Adding the above inequalities for A: = 0,...,—, we obtain 


so that 


which contradicts the definition of N. Q.E.D. 


dist(a;Ar+m; Ar*)2 < dist(a:o; Ar*)2 — ( — + 1 j ae, 


+ 1 ae < dist(xo; A*)2, 


According to Prop. 3.3, to achieve a cost function value within 0(e) of the optimal, the term aPm?c^ 
must also be of order 0{e), so a must be of order 0(elm?c^), and from Eq. (3.19), the number of necessary 
iterations N is 0(m?(? je^), and the number of necessary cycles is 0((mc)2/e^)). This is the same type of 
estimate as for the nonincremental subgradient method [i.e., 0(l/e^), counting a cycle as one iteration of 
the nonincremental method, and viewing me as a Lipschitz constant for the entire cost function E], and 
does not reveal any advantage for the incremental methods given here. However, in the next section, we 
demonstrate a much more favorable iteration complexity estimate for the incremental methods that use a 
randomized order of component selection. 
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Exact Convergence for a Diminishing Stepsize 

We can also obtain an exact convergence result for the case where the stepsize ak diminishes to zero. The 
idea is that with a constant stepsize a we can get to within an 0(Q;)-neighborhood of the optimum, as shown 
above, so with a diminishing stepsize Ofc, we should be able to reach an arbitrarily small neighborhood of the 
optimum. However, for this to happen, should not be reduced too fast, and should satisfy ctk = oo 

(so that the method can “travel” infinitely far if necessary). 


Proposition 3.4: Let {xk} be the sequence generated by any one of the algorithms (2.12)-(2.14), 
with a cyclic order of component selection, and let the stepsize ak satisfy 

OO 

lim Ofc = 0, ttfe = OO. 


Then, 


Furthermore, if X* is nonempty and 


lim inf F{xk) = F*. 

k—¥oo 

OO 

'^al<oo, 

k=0 


then {xk} converges to some x* G X*. 


Proof: For the first part, it will be sufficient to show that liminffc_i.oo F(xkm) = F*. Assume, to arrive at 
a contradiction, that there exists an e > 0 such that 

lim inf F(xkm) — 2e > F *. 

k—¥co 

Then there exists a point y G X such that 

liminfF(xfcm) - 2e > F{y). 

k—¥GO 

Let ko be large enough so that for all k > ko, we have 

F{xkra) > liminfF(xfcm) - £• 

k—¥OC 

By combining the preceding two relations, we obtain for all k > ko, 

F{xkra) - F{y) > e. 

By setting y = y in Prop. 3.1, and by using the above relation, we have for all k > ko, 

||a;(fc+i)m - y|p < \\xkm - yP - 2akme + = \\xkm - yP - Ofem (2e - (SakmrrFc ^). 

Since —>■ 0, without loss of generality, we may assume that ko is large enough so that 

2e — pakrrFc^ > e, \f k > ko. 
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Therefore for all k > ko, we have 


k 

||a;(fe+l)m - y|p < \\xkfn - y|p - afemC < ' ' ' < \\Xkom “ £ X! 

i=kQ 


which cannot hold for k sufficiently large. Hence liminffc^oo F{xkm) = F*. 

To prove the second part of the proposition, note that from Prop. 3.1, for every x* € -X* and fc > 0 we 

have 

||a;(fc+i)m - x*\\‘^ < \\xkm - x*\\'^ - 2akm{F{xkm) - F{x*)) + (3.20) 

The Supermartingale Convergence Theorem (Prop. 2.2)f and the hypothesis ^ imply that 

{Ipfcm —2^*11} converges for every x* G X* . Since then {xkm} is bounded, it has a limit point x € X that 
satisfies 

F{x) = liminf F{xkm) = F*. 

k—¥oc> 

This implies that x G X*, so it follows that — a;||} converges, and that the entire sequence {xkm} 

converges to x (since a; is a limit point of {xkm})- 

Finally, to show that the entire sequence {xk} also converges to x, note that from Eqs. (3.1) and (3.3), 
and the form of the iterations (2.12)-(2.14), we have ||a;fc+i — a;fc|| < 2afcC —>■ 0. Since {xkm\ converges to x, 
it follows that {a;^} also converges to a;. Q.E.D. 


4. CONVERGENCE FOR METHODS WITH RANDOMIZED ORDER 


In this section, we discuss convergence for the randomized component selection order and a constant stepsize 
a. The randomized versions of iterations (2.12), (2.13), and (2.14), are 


Zk = Px {xk - aVfuj^izk)), Xk+i = Px{zk - aXhu,^{zk)), (4.1) 

Zk = Xk - aXfu^{zk)j Xk+i = Px{zk - aXhu,^{zk)), (4.2) 

Zk = Px{xk - aXhuj^^izk)), Xk+i = Zk- aXfu,^{xk+i), (4.3) 

respectively, where {ujk} is a sequence of random variables, taking values from the index set {1,..., m}. 

We assume the following throughout the present section. 


f Actually we use here a deterministic version/special case of the theorem, where Yk, Zk, and Wk are nonnegative 
scalar sequences satisfying Yfc+i <Yk — Zk + Wk with ^ Then the sequence Yk must converge. This 

version is given with proof in many sources, including [BeT96] (Lemma 3.4), and [BeTOO] (Lemma 1). 
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Assumption 4.1: [For iterations (4.1) and (4.2)] 

(a) {wfe} is a sequence of random variables, each uniformly distributed over {1,... ,m}, and such 
that for each k, uJk is independent of the past history {xk, Zk-i,Xk-i,..., zo, a;o}- 

(b) There is a constant c € U such that for all k, we have with probability 1 

max{||V/*(4)||, ||V/ii(4)||} < c, (4.4) 

max {fi{xk) - fi{zl), h^{xk) - hi{zl)} < c\\xk - zlW, Vz = l,...,m, (4.5) 

where 4 is the result of the proximal iteration, starting at Xk if Wfe would be z, i.e., 

zl = argmin |/i(a;) + - a^fclpj , 

in the case of iteration (4.1), and 

4 = argmin|/,(x) + ^||x-x,p|, 

in the case of iteration (4.2). 

Assumption 4.2: [For iteration (4.3)] 

(a) {uJk} is a sequence of random variables, each uniformly distributed over {1,... ,m}, and such 
that for each k, Uk is independent of the past history {xk, Zk-i,Xk-i,..., zo, xq}. 

(b) There is a constant c € 3? such that for all k, we have with probability 1 

max{||V/i(x*j,_^J]|, ]|V4(xfc)||} < c, Vz = l,...,m, (4.6) 

Hxk) - < c\\xk - xl^J, Vz = l,...,m, (4.7) 

where x\j^-^ is the result of the iteration, starting at Xk if Wfc would be z, i.e., 

x(,+i = Fx(4 - afcV/4x(,+i)), 

with 

zl=Xk- akVhi{xk). 

Note that condition (4.5) is satisfied if there exist subgradients of fi and hi at Xk with norms less or 
equal to c. Thus the conditions (4.4) and (4.5) are similar, the main difference being that the first applies 
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to “slopes” of fi and hi at zl while the second applies to the “slopes” of fi and hi at Xk- As in the case 
of Assumption 3.1, these conditions are guaranteed by Lipschitz continuity assumptions on fi and hi. The 
convergence analysis of the randomized algorithms of this section is somewhat more complicated than the 
one of the cyclic order counterparts, and relies on the Supermartingale Convergence Theorem. The following 
proposition deals with the case of a constant stepsize, and parallels Prop. 3.2 for the cyclic order case. 


Proposition 4.1: Let {xk} be the sequence generated by one of the randomized incremental 

methods (4.1)-(4.3), and let the stepsize au be fixed at some positive constant a. 

(a) If F* = — oo, then with probability 1 

inf F{xk) = F*. 


(b) If F* > — oo, then with probability 1 


inf F{xk) < F* + 
k>0 


aj3mc^ 

2 



where (3 = 5. 


Proof: Consider first algorithms (4.1) and (4.2). By adapting the proof argument of Prop. 3.1 with Fi^ 
replaced by F^j^, [cf. Eq. (3.9)], we have 

\\xk+i - yW"^ < \\xk - yW"^-2a{Fu;^{zk) - Fuj^iy)) + a'^c^, y y G X, k>0. 


By taking the conditional expectation with respect to Fk = {xk,Zk-i,... ,zo,xo}, and using the fact that 
Wfc takes the values i = 1,... ,m with equal probability 1/m, we obtain for s\\ y G X and k, 

F{||a;fc+i - ylp I Fk} < \\xk - 2 /|p - 2aE{F^^,{zk) - Fi^^,{y) \ Fk) + 

„ m 

= \W - y\? - - F{y)) + 

(4.8) 

= \W - 2 /P - —{F{xk) - F{y)) + —Y^{F{xk) - F{zl)) + 

i—l 

By using Eqs. (4.4) and (4.5), 

m mm 

- MzD) < 2c^ \\xk - 411 = 2ca^ ||V/i(4)|| < 2maF. 

i—l i—l i—l 

By combining the preceding two relations, we obtain 


E{\\xk+i - 2/P I Fk} < \\xk - 2/P - — {F{xk) - F{y)) + + a‘^c^ 

2n 

= IPfe - 2/P - - F{y)) + (3a‘^c^, 


(4.9) 
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where /3 = 5. 

The preceding equation holds also for algorithm (4.3). To see this note that Eq. (3.16) yields for all 

y&x 


Ikfc+I - 2/IP < \\xk -y\V - 2a(TL;fc(xfc) - F^^{y)) +a^c^ + 2a{f^^{xk) - fu,k{xk+i)), 

and similar to Eq. (4.8), we obtain 

^ rv ^ 

E{\\xk+i -J/Ip I Tk) < \\xk-y\\^ - —{F{xk) - F{y)) + —^{fi{xk) - /»(4+i)) 

i—1 

Erom Eq. (4.7), we have 

Mxk) - /i(4+i) ^ c||xfc - 4+1II, 

and from Eq. (4.6) and the nonexpansion property of the projection, 

Ikfc - 4+1II - INfc “ 4 + “^/*(^fc+i)|| = \\^k - Xk+ aVhi{xk) + Q;V/i(x4i)|| < 2 q;c. 

Combining the preceding inequalities, we obtain Eq. (4.9) with (3 = 5. 

Let us fix a positive scalar 7 , consider the level set defined by 

„2 ' 


(4.10) 


(4.11) 


.Zv'*y - 


and let ^ X he such that 


{xeX I E(4 < -7 + 1 + ^^^} ifF* = -oo, 
{x e X I F(4 < E* + I + iiF*>-oo, 


-7 if F* = — 00 , 

F{yi) = i + 1 if > _oo. 


Note that y-y G Ly, by construction. Define a new process {xk} that is identical to {xk}, except that once Xk 
enters the level set Lj, the process terminates with Xk = 2 / 7 - We will now argue that for any fixed 7 , {5:^} 
(and hence also {xfc}) will eventually enter Ly, which will prove both parts (a) and (b). 

Using Eq. (4.9) with y = ?/y, we have 


2a 


from which 


where 


E{\\xk+i - y-i\\^ I Fk) < \\xk - y 7 |p- {E{xk) - E{yj)) + j3a^c^ 


E{||xfe+i - yyip I Fk) < ll^fe - y7|p - Ufc, 


+ ^ f ^(J^(4)-E(yy))-/3a2c2 if^^Ey, 

I 0 if Xfc = yy. 


(4.12) 


The idea of the subsequent argument is to show that as long as Xk ^ Ty, the scalar Vk (which is a measure 
of progress) is strictly positive and bounded away from 0 . 


(a) Let F* = — 00 . Then if Xk we have 


2a 

Vk = — (E(4) - F(yy)) - /3a2c2 

m 


> 


2a 

m 

2a 


-J+1 + + 7 ) - / 3 a 2 c 2 
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Since Vk = 0 for Xk € L-y, we have Vk > 0 for all k, and by Eq. (4.12) and the Supermartingale Convergence 
Theorem (cf. Prop. 2.2), ^ °° implying that Xk G for sufficiently large /c, with probability 1. 

Therefore, in the original process we have 


inf F{xk) < 




a/Smc^ 

2 


with probability 1. Letting 7 00 , we obtain inffc>o F{xk) = —00 with probability 1. 

(b) Let F* > — 00 . Then if Xk ^ Lj, we have 

2a 

Vk = — {F{xk) - F{yy)) - Pa^c^ 

m 

2a f 2 aPmc^ 1 \ 

> — ( F* H-1--- F* -I — Pa^c^ 

m \ 7 2 7 / 

2a 

mj 

Hence, Vk > 0 for all k, and by the Supermartingale Convergence Theorem, we have X^o Vk < co implying 
that Xk G Lj for sufficiently large k, so that in the original process, 

p , , 2 aPmc^ 

mf F{xk) <F*-\ - 1 --— 

fe>o 72 


with probability 1. Letting 7 —>■ 00 , we obtain inffc>o F{xk) < F* + apmc^/2. Q.E.D. 


By comparing Prop. 4.1(b) with Prop. 3.2(b), we see that when F* > —00 and the stepsize a is 
constant, the randomized methods (4.1), (4.2), and (4.3), have a better error bound (by a factor m) than 
their nonrandomized counterparts. In fact an example given in p. 514 of [BNO03] for the incremental 
subgradient method can be adapted to show that the bound of Prop. 3.2(b) is tight in the sense that for a 
bad problem/cyclic order we have liminffc_,,oo F{xk) — F* = 0{am'^c^). By contrast the randomized method 
will get to within 0{amc^) with probability 1 for any problem, according to Prop. 4.1(b). Thus with the 
randomized algorithm we do not run the risk of choosing by accident a bad cyclic order. A related result 
is provided by the following proposition, which should be compared with Prop. 3.3 for the nonrandomized 
methods. 


Proposition 4.2: Assume that X* is nonempty. Let {xk} be a sequence generated 
Then for any positive scalar e, we have with probability 1 

as in Prop. 4.1. 

N ^ aPmc^ + e 

mm F{xk) < F* ---, 

(4.13) 

0<k<N 2 

where A^ is a random variable with 


nTAri dist(xo;X*)2 

E{N} <m - — - 

(4.14) 

ae 
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Proof: Let y be some fixed vector in X*. Define a new process {xk} which is identical to {xk} except that 
once Xk enters the level set 


L=\xeX 


F{x) < F* + 


aPmc^ + e 


the process {xk} terminates at y. Similar to the proof of Prop. 4.1 [cf. Eq. (4.9) with y being the closest 
point of Xk in X*], for the process {xfc} we obtain for all k, 


E{dist{xk+i; X*)^ \ Fk} <E{\\xk+i-y\\^\Xk} 

Ory 

< distixk] X*)'^ - (F{xk) — E*) + jda^c^ (4-15) 

m 

= dist(xfc;X*)2 - vk, 


where Fk = {xk,Zk-i, ■ ■ ■ ,Zo,xo} and 


Vk 


^{F{xk)-F*)-Pd^c^ ifxfc^L, 

0 otherwise. 


In the case where Xk ^ L, we have 


2a 

Vk> — 
m 



a(3mc^ + e 
2 



/3a‘^c^ 


ae 

m 


(4.16) 


By the Supermartingale Convergence Theorem (cf. Prop. 2.2), from Eq. (4.15) we have 


< oo 

k=0 


with probability 1, so that Vk = 0 for all k > N, where IV is a random variable. Hence xn & L with 
probability 1, implying that in the original process we have 


min F(xk) < E* + 

0<k<N 


ajdmc^ + e 
2 


with probability 1. Furthermore, by taking the total expectation in Eq. (4.15), we obtain for all k, 


E{dist(a;fc+i;X*)2} < E{dist(a;fe;X*)^} — E{vk} < dist(i;o;X*)^ — E 



where in the last inequality we use the facts xq = xo and E{dist(a;o;X*)^} = dist(a;o;X*)^. Therefore, 
letting k oo, and using the definition of Vk and Eq. (4.16), 


dist(a;o;X*)2 > E 



Q.E.D. 


Like Prop. 4.1, a comparison of Props. 3.3 and 4.2 again suggests an advantage for the randomized 
methods: compared to their deterministic counterparts, they achieve a much smaller error tolerance (a 
factor of m), in the same expected number of iterations. Note, however, that the preceding assessment is 
based on upper bound estimates, which may not be sharp on a given problem [although the bound of Prop. 
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3.2(b) is tight with a worst-case problem selection as mentioned earlier; see [BNO03], p. 514]. Moreover, the 
comparison based on worst-case values versus expected values may not be strictly valid. In particular, while 
Prop. 3.3 provides an upper bound estimate on N, Prop. 4.2 provides an upper bound estimate on E{N}, 
which is not quite the same. 

Finally for the case of a diminishing stepsize, let us give the following proposition, which parallels Prop. 
3.4 for the cyclic order. 


Proposition 4.3: Let {xk} be the sequence generated by one of the randomized incremental 

methods (4.1)-(4.3), and let the stepsize au satisfy 

OO 

lim Ofc = 0, 7 Ofc = OO. 

k—¥oo ' ^ 

Then, with probability 1, 

lim inf F{xk) = F*. 

k—¥oo 

Furthermore, if X* is nonempty and 

OO 

'^al<oo, 

k=0 

then {xk} converges to some x* G X* with probability 1. 


Proof: The proof of the first part is nearly identical to the corresponding part of Prop. 3.4. To prove the 
second part, similar to the proof of Prop. 4.1, we obtain for all k and all x* G X*, 

F{\\xk+i-x*W^\Fk) < \\xk-x*\\^-‘^{F{xk)-F*)+Palc^ (4.17) 

[cf. Eq. (4.9) with a and y replaced with ak and x*, respectively], where Fk = {xk, Zk-i, ■ ■ ■, Zo,xo}- By the 
Supermartingale Convergence Theorem (Prop. 2.2), for each x* G X*, there is a set fix* of sample paths of 
probability 1 such that for each sample path in fix* 

V] — (^(xfe) - F*) < OO, (4.18) 

k=0 

and the sequence {\\xk — x*||} converges. 

Let {vi} be a countable subset of the relative interior ri(X*) that is dense in X* [such a set exists since 
ri(X*) is a relatively open subset of the affine hull of X*; an example of such a set is the intersection of X* 
with the set of vectors of the form x* + ^*5*) where ^i,..., are basis vectors for the affine hull of X* 

and Ti are rational numbers]. Let also fivi be the set of sample paths defined earlier that corresponds to Vi. 
The intersection 

fi = 

has probability 1, since its complement 0,^ is equal to U“;^r2S. and 

OO 

Prob ^ Prob (flgj = 0. 
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For each sample path in O, all the sequences {Hxfe — ni||} converge so that {xk} is bounded, while by 
the first part of the proposition [or Eq. (4.18)] liminffc_>oo = F*■ Therefore, {xk} has a limit point 

X in X*. Since {vi} is dense in X*, for every e > 0 there exists such that jja; — < e. Since the 

sequence {\\xk — i^i(e)||} converges and a; is a limit point of {xk}, we have limf,_,.oo ||a;fc — Wi(e)|| < e, so that 

limsup lixfc - S|| < lim \\xk - Vi(X\ + \\v^u) - x\\ < 2e. 

fc->oo fc^oo 

By taking e ^ 0, it follows that Xk —>■ x. Q.E.D. 


5. SOME APPLICATIONS 

In this section we illustrate our methods in the context of two types of practical applications, and discuss 
relations with known algorithms. 

5.1 Regularized Least Squares 


Let us consider least squares problems, involving minimization of a sum of quadratic component functions 
fi{x) that correspond to errors between data and the output of a model that is parameterized by a vector 
X. Often a convex regularization function R{x) is added to the least squares objective, to induce desirable 
properties of the solution. This gives rise to problems of the form 


minimize 


Fix) + ^'^{cix - d^)^ 

^ i-1 


subject to X € 


(5.1) 


where Ci and di are given vectors and scalars, respectively, and 7 is a positive scalar. When R is differentiable 
(e.g., quadratic), and either m is very large or the data (ci,di) become available sequentially over time, it 
makes sense to consider incremental gradient methods, which have a long history of applications over the 
last 50 years, starting with the Widrow-Hoff least mean squares (LMS) method [WiHGOj. 

The classical type of regularization involves a quadratic function R (as in classical regression and the 
LMS method), but nondifferentiable regularization functions have become increasingly important recently. 
On the other hand, to apply our incremental methods, a quadratic R is not essential. What is important 
is that R has a simple form that facilitates the use of proximal algorithms, such as for example a separable 
form, so that the proximal iteration on R is simplified through decomposition. As an example, consider the 
£i-regularization problem, where 

n 

Fix) =7\\x\\i=-f'^\xt\, (5.2) 

i=i 

7 is a positive scalar and x^ is the jth coordinate of x. Then the proximal iteration 

decomposes into the n one-dimensional minimizations 

zi = arg min | 7 |xl| -b “ xi\^ \ , j = 1 ,... ,n, 

xteSR ( 2ak j 
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and can be done in closed form 

{ ^k ~ if lock < xl, 

0 a-jak < xl < jak, J = l, (5.3) 

xi+jctk 

We refer to Figueiredo, Nowak, and Wright [FNW07], Wright, Nowak, and Figueiredo [WNF08], Beck 
and Teboulle [BeTlO], and the references given there, for a discussion of a broad variety of applications in 
estimation and signal processing problems, where nondifferentiable regularization functions play an important 
role. 

We now note that the incremental algorithms of this paper are well-suited for solution of £i-regularization 
problems of the form (5.1)-(5.2). For example, the /cth incremental iteration may consist of selecting a data 
pair {ci^,di^) and performing a proximal iteration of the form (5.3) to obtain Zk, followed by a gradient 
iteration on the component starting at Zk'- 

Xk-\-l ~ ^k 

This algorithm is the special case of the algorithms (2.12)-(2.14) (here X = JR", and all three algorithms 
coincide), with fi{x) being 7 ||a;||i (we use m copies of this function) and hi{x) = ^(c'x — diY- It can be 
viewed as an incremental version of a popular class of algorithms in signal processing, known as iterative 
shrinkage/thresholding (see Chambolle et. al. [CDL98], Figueiredo and Nowak [FiN03], Daubechies, Defrise, 
and Mol [DDM04], Combettes and Wajs [CoW05], Bioucas-Dias and Figueiredo [BiF07], Elad, Matalon, and 
Zibulevsky [EMZ07], Beck and Teboulle [BeT09], [BeTlO]). Our methods bear the same relation to this class 
of algorithms as the LMS method bears to gradient algorithms for the classical linear least squares problem 
with quadratic regularization function. 

Finally, let us note that as an alternative, the proximal iteration (5.3) could be replaced by a proximal 
iteration on 7 |x^ | for some selected index j, with all indexes selected cyclically in incremental iterations. 
Randomized selection of the data pair (ci^,diY would also be interesting, particularly in contexts where the 
data has a natural stochastic interpretation. 

5.2 Iterated Projection Algorithms 


A feasibility problem that arises in many contexts involves finding a point with certain properties within a 
set intersection where each Xi is a closed convex set. For the case where m is large and each of the 

sets Xi has a simple form, incremental methods that make successive projections on the component sets Xi 
have a long history (see e.g., Gubin, Polyak, and Raik [GPR67], and recent papers such as Bauschke [BauOl], 
Bauschke, Gombettes, and Kruk [BCL06], and Gegielski and Suchocka [CeS08], and their bibliographies). 
We may consider the following generalized version of the classical feasibility problem. 


minimize f{x) 
subject to a; G 

where / : JR" M- JR is a convex cost function, and the method 


(5.4) 


= Pxi^ {xk - akXf{xk)), (5.5) 

where the index ik is chosen from {!,... ,m} according to a randomized rule. Incremental algorithms for 
problem (5.4), which bear some relation with ours have been recently proposed by Nedic [NedlOj. Actually, 
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the algorithm of [NedlO] involves an additional projection on a special set Xq at each iteration, but for 
simplicity we will take Xq = JR". The incremental approach is particularly well-suited for problems of the 
form (5.4) where the sets Xi are not known in advance, but are revealed as the algorithm progresses. 

While the problem (5.4) does not involve a sum of component functions, it may be converted into one 
that does by using an exact penalty function. In particular, consider the problem 


m 

minimize f{x) + 7 dist(a;; Xi) 

i=l 

subject to a: G JR", 


(5.6) 


where 7 is a positive penalty parameter. Then for / Lipschitz continuous and 7 sufficiently large, problems 
(5.4) and (5.6) are equivalent. We show this for the case where to = 1 and then we generalize. 


Proposition 5.1: Let f : Y 1 —>■ JR be a function defined on a subset Y of JR", and let X be a 
nonempty closed subset of Y. Assume that / is Lipschitz continuous over Y with constant L, i.e., 

\fix) - fiy)\ < L\\x-y\\, yx,y€Y, 


and let 7 be a scalar with 7 > L. Then the set of minima of / over X coincides with the set of minima 
of 


f{x) -I- 7 dist(x; A) 


over Y. 


Proof: Denote F{x) = f{x) + 7 dist(a;; A). For a vector x G Y, let x denote a vector of A that is at 
minimum distance from A. If 7 > L, we have using the Lipschitz property of /, 

F{x) = f{x) + 7 ||a; - x|| = f{x) + (/(x) - f{x)) + 7 ||a; - s|| > /(x) = F{x), M x GY, 

with strict inequality ii x ^ x. Hence the minima of F over Y can only lie within A, while F = f within A. 
This shows that if 7 > L, then x* minimizes / over A if and only if x* minimizes F over Y. Q.E.D. 

We now provide a generalization for to > I. 


Proposition 5.2: Let f '-Y i-j. JR be a function defined on a subset Y of JR", and let Aj, t = I,..., to, 
be closed subsets of Y with nonempty intersection. Assume that / is Lipschitz continuous over Y. 
Then there is a scalar 7 > 0 such that for all 7 > 7 , the set of minima of / over n”bj^Ai coincides with 
the set of minima of 

m 

f{x) -t- 7 ^dist(a;; Ai) 

i=l 

over Y. 
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Proof: Let L be the Lipschitz constant for /, and let 71 ,..., 7 ^ be scalars satisfying 

7fc>L + 7iH-l-7fe-i> V/c = 1,... ,m, 

where 70 = 0. Define 

F^{x) = f{x) + 71 dist(x; Xi) H-h 7 fe dist(x; X^), k = l,...,m, 

and for fc = 0, denote F^{x) = f{x), 70 = 0. By applying Prop. 5.1, the set of minima of F™ over Y 
coincides with the set of minima of over Xm-, since 7 m is greater than L + 71 + • • • + 7 m- 1 , the 

Lipschitz constant for Similarly, for all fc = l,...,m, the set of minima of F* over 

coincides with the set of minima of F^~^ over Thus, for A: = 1, we obtain that the set of minima of 

/ + 7 ^™ ^ dist(-; Xi) = F™ over Y coincides with the set of minima of / = F^ over Q.E.D. 


Note that while the penalty parameter thresholds derived in the preceding proof are quite large, lower 
thresholds may hold under additional assumptions, such as for convex / and polyhedral Xi. Regarding 
algorithmic solution, from Prop. 5.2, it follows that we may consider in place of the original problem (5.4) 
the additive cost problem (5.6) for which our algorithms apply. In particular, let us consider the algorithms 
(2.12)-(2.14), with X = 5R", which involve a proximal iteration on one of the functions 7 dist(a;;Xi) followed 
by a subgradient iteration on /. A key fact here is that the proximal iteration 


Zk = arg min 


7 dist(x; X^^^ ) + -— \\x - Xk\ 
Zak 


(5.7) 


involves a projection on Xi^ of Xk, followed by an interpolation. This is shown in the following proposition. 


Proposition 5.3: Let Zk be the vector produced by the proximal iteration (5.7). If Xk G Xi^, then 
Zk = Xk, while if Xk ^ 


r (I - ^k)xk + PkPxi^ (xk) if 13k < 1, 
\Pxi^{xk) if/3fe>I, 


where 


f3k 


Oikl 

dist(xfc; Aij.) 


Proof: The case Xk G Xi^^ is evident, so assume that Xk ^ Xi,,. From the nature of the cost function in 
Eq. (5.7) we see that Zk is a vector that lies in the line segment between Xk and Fx-^ (xk). Hence there are 
two possibilities: either 

Zk = Pxi^^ixk), (5.9) 

or Zk ^ Xif^ in which case by setting to 0 the gradient at Zk of the cost function in Eq. (5.7) yields 


7 


Zk - PXi^ (zk) 
Zk - PXi^ {zk) 


{xk Zk) • 
Oik 
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This equation implies that Xk, Zk, and Pxi^ {zk) lie on the same line, so that Pxi^ (zk) = Pxi^^ (xk) and 


Zk=Xk- ^ = (1 “ Pk)Xk + PkPXi^ (xk)- (5.10) 

By calculating and comparing the value of the cost function in Eq. (5.7) for each of the possibilities (5.9) 
and (5.10), we can verify that (5.10) gives a lower cost if and only if Pk < 1. Q.E.D. 

Let us now consider the problem 


minimize E (Mx) + h^{x)) 
subject to X € 

Based on the preceding analysis, we can convert this problem to the unconstrained minimization problem 


minimize E {fi{x) + hi{x) + 7 dist(a;; Xi)) 

i=l 

subject to a; € K”, 

where 7 is sufficiently large. The algorithm (2.14), applied to this problem, yields the iteration 


Vk — Xk Oik'^ hi^ixk^ ^ Zk — Uk ^k^ fik{zk')^ Xk-\-l — 


{1 -/3k)zk + PkPxii^izk) if/3fc<l, 
Pxi^izk) if/3fc>l. 


where 


Pk 


otkl 

dist{zk; Xi,^) ’ 


[cf. Eq. (5.8)]. The index ik may be chosen either randomly or according to a cyclic rule. 
Let us finally note another problem where our incremental methods apply: 


minimize f{x) + c E max{ 0 ,gj(x)} 
1=1 


subject to a; G 


This type of problem is obtained by replacing convex inequality constraints of the form gj{x) < 0 with 
the nondifferentiable penalty terms cmax |0, ( 7 j(a;)}, where c > 0 is a penalty parameter. Then a possible 
incremental method at each iteration, would either do a subgradient or proximal iteration on /, or select one 
of the violated constraints (if any) and perform a subgradient iteration on the corresponding function gj , or 
select one of the sets Xi and do an interpolated projection on it. Related methods may also be obtained 
when / is replaced by a cost function of the form 


{Mx) + hiix)), 


and the components fi are dealt with a proximal iteration while the components hi are dealt with a subgra¬ 
dient iteration. 
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6. CONCLUSIONS 

We have surveyed incremental algorithms, which can deal with many of the challenges posed by large data 
sets in machine learning applications, as well as with the additive structure of many interesting problems, 
including those arising in the context of duality. We have used a unified analytical framework that includes 
incremental proximal algorithms and their combinations with the more established incremental gradient 
and subgradient methods. This allows the flexibility to separate the cost function into the parts that are 
conveniently handled by proximal iterations (e.g., in essentially closed form), and the remaining parts to 
be handled by subgradient iterations. We have outlined the convergence properties of these methods, and 
we have shown that our algorithms apply to some important problems that have been the focus of recent 
research. 

Much work remains to be done to apply and evaluate our methods within the broad context of potential 
applications. Let us mention some possibilities that may extend the range of applications of our approach, and 
are interesting subjects for further investigation: alternative proximal and projected subgradient iterations, 
involving nonquadratic proximal terms and/or subgradient projections, alternative stepsize rules, distributed 
asynchronous implementations along the lines of [NBBOl], polyhedral approximation (bundle) variants of 
the proximal iterations in the spirit of [BeY09], and variants for methods with errors in the calculation of 
the subgradients along the lines of [NeBlO]. 
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